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Abstract 

A nonparametric kernel-based method for realizing Bayes' rule is pro- 
posed, based on representations of probabilities in reproducing kernel 
Hilbert spaces. Probabilities are uniquely characterized by the mean of 
the canonical map to the RKHS. The prior and conditional probabilities 
are expressed in terms of RKHS functions of an empirical sample: no 
explicit parametric model is needed for these quantities. The posterior 
is likewise an RKHS mean of a weighted sample. The estimator for the 
expectation of a function of the posterior is derived, and rates of consis- 
tency are shown. Some representative applications of the kernel Bayes' 
rule are presented, including Baysian computation without likelihood and 
filtering with a nonparametric state-space model. 



1 Introduction 

Kernel methods have long provided powerful tools for generalizing linear sta- 
tistical approaches to nonlinear settings, through an embedding of the sample 
to a high dimensional feature space, namely a reproducing kernel Hilbert space 
(RKHS) [18, 28]. Examples include support vector machines, kernel PCA, and 
kernel CCA, among others. In these cases, data are mapped via a canoni- 
cal feature map to a reproducing kernel Hilbert space (of high or even infinite 
dimension), in which the linear operations that define the algorithms are imple- 
mented. The inner product between feature mappings need never be computed 
explicitly, but is given by a positive definite kernel function unique to the RKHS: 
this permits efficient computation without the need to deal explicitly with the 
feature representation. 

The mappings of individual points to a feature space may be generalized to 
mappings of probability measures [e.g. 3, Chapter 4]. We call such mappings the 
kernel means of the underlying random variables. With an appropriate choice 
of positive definite kernel, the kernel mean on the RKHS uniquely determines 
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the distribution of the variable [10, 11, 35], and statistical inference problems on 
distributions can be solved via operations on the kernel means. Applications of 
this approach include homogeneity testing [14, 15], where the empirical means 
on the RKHS are compared directly, and independence testing [16, 17], where 
the mean of the joint distribution on the feature space is compared with that of 
the product of the marginals. Representations of conditional dependence may 
also be defined in RKHS, and have been used in conditional independence tests 
[13]. 

In this paper, we propose a novel, nonparametric approach to Bayesian in- 
ference, making use of kernel means of probabilities. In applying Bayes' rule, 
we compute the posterior probability of x in X given observation y in y; 

p(y\x)n(x) 

q{xly) = -ImT' (1) 

where w(x) and p(y\x) are the density functions of the prior and the likelihood 
of y given x, respectively, with respective base measures vx and vy, and the 
normalization factor qy(y) is given by 

Qy(y) = J ' p(y\x)ir(x)dv x {x). (2) 

Our main result is a nonparametric estimate of the kernel mean posterior, given 
kernel mean representations of the prior and likelihood. 

A valuable property of the kernel Bayes' rule is that the kernel posterior 
mean is estimated nonparametrically from data; specifically, the prior and the 
likelihood are represented in the form of samples from the prior and the joint 
probability that gives the likelihood, respectively. This confers an important 
benefit: we can still perform Bayesian inference by making sufficient observa- 
tions on the system, even in the absence of a specific parametric model of the 
relation between variables. More generally, if we can sample from the model, 
we do not require explicit density functions for inference. Such situations are 
typically seen when the prior or likelihood is given by a random process: Ap- 
proximate Bayesian Computation [23, 29, 39] is widely applied in population 
genetics, where the likelihood is given by a branching process, and nonparamet- 
ric Bayesian inference [26] often uses a process prior with sampling methods. 
Alternatively, a parametric model may be known, however it might be of suffi- 
cient complexity to require Markov chain Monte Carlo or sequential Monte Carlo 
for inference. The present kernel approach provides an alternative strategy for 
Bayesian inference in these settings. Wc demonstrate rates of consistency for our 
posterior kernel mean estimate, and for the expectation of functions computed 
using this estimate. 

An alternative to the kernel mean representation would be to use nonpara- 
metric density estimates for the posterior. Classical approaches include kernel 
density estimation (KDE) or distribution estimation on a finite partition of 
the domain. These methods are known to perform poorly on high dimensional 
data, however. By contrast, the proposed kernel mean representation is defined 



2 



as an integral or moment of the distribution, taking the form of a function in 
an RKHS. Thus, it is more akin to the characteristic function approach (see 
e.g. [20]) to representing probabilities. A well conditioned empirical estimate of 
the characteristic function can be difficult to obtain, especially for conditional 
probabilities. By contrast, the kernel mean has a straightforward empirical es- 
timate, and conditioning and marginalization can be implemented easily, at a 
reasonable computational cost. 

The proposed method of realizing Bayes' rule is an extension of the approach 
used in [31] for state-space models. In this earlier work, a heuristic approxima- 
tion was used, where the kernel mean of the new hidden state was estimated 
by adding kernel mean estimates from the previous hidden state and the ob- 
servation. Another relevant work is the belief propagation approach in [32, 34], 
which covers the simpler case of a uniform prior. 

This paper is organized as follows. We begin in Section 2 with a review of 
RKHS terminology and of kernel mean embeddings. In Section 3, we derive 
an expression for Bayes' rule in terms of kernel means, and provide consistency 
guarantees. We apply the kernel Bayes' rule in Section 4 to various inference 
problems, with numerical results and comparisons with existing methods in Sec- 
tion 5. Our proofs are contained in Section 6 (including proofs of the consistency 
results of Section 3). 

2 Preliminaries: positive definite kernel and prob- 
abilities 

Throughout this paper, all Hilbert spaces are assumed to be separable. For an 
operator A on a Hilbert space, the range is denoted by 7Z(A). The linear hull 
of a subset S in a vector space is denoted by SpanS". 

We begin with a review of positive definite kernels, and of statistics on the 
associated reproducing kernel Hilbert spaces [1, 3, 10, 11]. Given a set 51, a (R- 
valued) positive definite kernel k on 51 is a symmetric kernel k : 51 x 51 — >• K such 
that J2ij=i c i c jk{xi,Xj) > for arbitrary number of points x\, . . . , x n in 51 and 
real numbers c\, . . . , c n . The matrix (k(xi, £j))"j=i is called a Gram matrix. It 
is known by the Moore- Aronszajn theorem [1] that a positive definite kernel on 
51 uniquely defines a Hilbert space % consisting of functions on 51 such that (i) 
k(-,x) £ H for any x £ 51, (ii) Span{fc(-,x) | x £ 51} is dense in H, and (iii) 
(/, fc(-, x)) = f(x) for any x £ 51 and / £ H (the reproducing property), where 
(■, ■) is the inner product of H. The Hilbert space H is called the reproducing 
kernel Hilbert space (RKHS) associated with k, since the function k x = k( ,x) 
serves as the reproducing kernel (/, k x ) = f(x) for / £ %. 

A positive definite kernel on 51 is said to be bounded if there is M > such 
that k(x, x) < M for any x £ 51. 

Let {X,Bx) be a measurable space, X be a random variable taking values 
in X with distribution Px, and k be a measurable positive definite kernel on 
X such that E[y/k(X,X)] < oo. The associated RKHS is denoted by H. The 
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kernel mean m x (also written m Px ) of X on the RKHS H is denned by the 
mean of the H-valued random variable k(-, X). The existence of the kernel mean 
is guaranteed by i?[|jfc(-,A)||] = E[y/k(X,X)] < oo. We usually write mx for 
m\ for simplicity, where there is no ambiguity. By the reproducing property, 
the kernel mean satisfies the relation 



{f,m x )=E[f(X)} (3) 
for any / € %. Plugging / = k(-,u) into this relation derives 

m x (u) = E[k{u,X)] = J k(u,x)dP x {x), (4) 

which shows the explicit functional form. The kernel mean mx is also denoted 
by mp x , as it depends only on the distribution Px with k fixed. 

Let (X, Bx) and (y, By) be measurable spaces, (X, Y) be a random variable 
on X x y with distribution P, and kx and ky be measurable positive definite 
kernels with respective RKHS Hx and Hy such that E[kx(X,X)] < oo and 
E[ky(Y,Y)} < oo. The (uncentered) covariance operator Cyx ' — ^ Hy is 
defined as the linear operator that satisfies 

(g,C YX f)Hy=E[f(X)g(Y)} 

for all / G T~Lxt9 G Hy. This operator Cyx can be identified with mryx) 
in the product space Hy ®Wxi which is given by the product kernel kykx 
on y y. X [1], by the standard identification between the linear maps and the 
tensor product. We also define Cxx for the operator on Hx that satisfies 
(h,Cxxh) = E[h{X)fi{X)} for any /i,/ 2 € Hx- Similarly to Eq. (4), the 
explicit integral expressions for Cyx and Cxx are given by 



(C Y xf)(y) = J k y (y,y)f(x)dP(x,y), (C X xf)(x) = J k x (x,x)f(x)dP x (x), 

(5) 

respectively. 

An important notion in statistical inference with positive definite kernels is 
the characteristic property. A bounded measurable positive definite kernel k on a 
measurable space (51, B) is called characteristic if the mapping from a probability 
Q on (£l,B) to the kernel mean mg £ T~L is injective [11, 35]. This is equivalent 
to assuming that Ex~p[k(-, X)} = Ex'~Q\k(-, X')) implies P = Q: probabilities 
are uniquely determined by their kernel means on the associated RKHS. With 
this property, problems of statistical inference can be cast as inference on the 
kernel means. A popular example of a characteristic kernel defined on Euclidean 
space is the Gaussian RBF kernel k(x,y) = exp(— \\x — y\\ 2 /(2a 2 )). It is known 
that a bounded measurable positive definite kernel on a measurable space (fi, B) 
with corresponding RKHS H is characteristic if and only if H + K is dense in 
L 2 (P) for arbitrary probability P on (f2,B), where H + R is the direct sum of 
two RKHSs % and E [1] . This implies that the RKHS defined by a characteristic 
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kernel is rich enough to be dense in I? space up to the constant functions. Other 
useful conditions for a kernel to be characteristic can be found in [12, 35, 36]. 

Throughout this paper, when positive definite kernels on a measurable space 
are discussed, the following assumption is made: 

(K) Positive definite kernels are bounded and measurable. 

Under this assumption, the mean and covariance always exist with arbitrary 
probabilities. 

Given i.i.d. sample (X\, Yi), . . . , (X n , Y n ) with law P, the empirical estimator 
of the kernel mean and covariance operator are given straightforwardly by 

n i n 

= -J2^ X (;X i ), =-Y^ky(;Yi)®k x {;Xi), 

i=l i=l 

where Cyx is written in tensor form. It is known that these estimators are ^Jn- 
consistent in appropriate norms, and y/n(fh^ — mx) converges to a Gaussian 
process on Hx [3, Sec. 9.1]. While we may use non-i.i.d. samples for numerical 
examples in Section 5, in our theoretical analysis we always assume i.i.d. samples 
for simplicity. 

3 Kernel expression of Bayes' rule 
3.1 Kernel Bayes' rule 

Let (X,Bx) and (y,By) be measurable spaces, (X,Y) be a random variable 
on X x y with distribution P, and kx and ky be positive definite kernels 
on X and y, respectively, with respective RKHS Hx and Hy. Let II be a 
probability on (X,Bx), which serves as a prior distribution. For each x E X, 
define a probability P Y \ X on (y,By) by P Y \ X {B) = E[Tb(Y)\X = x], where Ib 
is the index function of a measurable set B € By. The prior II and the family 
{Py\x | x £ X} defines the joint distribution Q on X x y by 

Q(AxB)= [ P Ylx {B)dIL(x) 

J A 

for any A g Bx and B g By, and its marginal distribution Qy by Qy(B) = 
Q(X x B). Throughout this paper, it is assumed that Py\ x and Q are well- 
defined under some regularity conditions. Let (Z, W) be a random variable on 
X x y with distribution Q. It is also assumed that the sigma algebra generated 
by W includes every point {y} (y <G y). For y € y, the posterior probability 
given y is defined by the conditional probability 

Qx\ y (A) = E[I A (Z)\W = y] {A e Bx)- (6) 

If the probability distributions have density functions with respect to measures 
vx on X and vy on y, namely, if the p.d.f. of P and II are given by p(x, y) and 
tt(x), respectively, Eq. (6) is reduced to the well known form Eq. (1). 
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The goal of this subsection is to derive an estimator of the kernel mean of 
posterior rn,Q x \ y . The following theorem is fundamental to discuss conditional 
probabilities with positive definite kernels. 

Theorem 3.1 ([10]). If E[g(Y)\X = ■] G Hx holds for g G Hy, then 

C X xE[g(Y)\X = .}=Cxy9. 

If Cxx is injective, i.e., if the function / G Hx with Cxxf = CxYg is 
unique, the above relation can be expressed as 

E[g{Y)\X = -]=Cxx- 1 C X Yg. (7) 

Noting (Cxxf,f) — E[f(X) 2 ], it is easy to see that Cxx is injective, if X 
is a topological space, kx is a continuous kernel, and Supp(Px) — X, where 
Supp(Px) is the support of P x . 

From Theorem 3.1, we have the following result, which expresses the kernel 
mean of Qy. 

Theorem 3.2 ([31], Eq. 6). Let mn and mQ y be the kernel means of H in 
Hx and Qy in Hy, respectively. If Cxx is injective, ran G lZ(Cxx), and 
E[g(Y)\X = ■} G Ux for any g G Hy, then 

m Qy = C Y xCxx~ l mYi- (8) 

Proof. Take / € Hx such that / = C xx mn- For any g G Hy, (Cyxf,g) = 
(f,C XY g) ='(f,CxxE[g(Y)\X = •]) = (C X x f, E[g(Y)\X = •]) = (m u ,E[g(Y)\X = 
•]> = ( m Qy i ff)> which implies C Y x f = m Qy . □ 

As discussed in [31], the operator CyxC xx can be regarded as the kernel 
expression of the conditional probability Py\ x or p(y\x). 

Note, however, that the assumption E[g(Y)\X = •] G Hx may not hold in 
general; we can easily give counterexamples in the case of Gaussian kernels 1 . 
In the following, we nonetheless derive a population expression of Bayes' rule 
under this strong assumption, use it as a prototype for defining an empirical 
estimator, and prove its consistency. 

Eq. (8) has a simple interpretation if the probabilities have density functions 
and ir(x)/px{x) is in Hx, where px is the density function of the marginal Px- 
FromEq. (4) we have mn(x) = J kx{x,x)i:(x)dvx{x) = / kx(x, x)(tt(x)/px (x))dPx(x), 
which implies C xx mn = ^ jpx from Eq. (5). Thus Eq. (8) is an operator ex- 
pression of the obvious relation 




ky{y,y)p{y\x)n{x)dux{x)dvy(y) = / ky(y,y)(n(x)/p x (x))dP{x,y). 



1 Suppose that Hx and Hy are given by Gaussian kernel, and that X and Y are indepen- 
dent. Then, E[g(Y)\X = x] is a constant function of x, which is known not to be included in 
a RKHS given by a Gaussian kernel [38, Corollary 4.44]. 
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In deriving kernel realization of Bayes' rule, we will use the following tensor 
representation of the joint probability Q, based on Theorem 3.2: 

VflQ = C(YX)X C xx m n eHy®H X - (9) 

In the above equation, the covariance operator C(yx)x ■ Hx — > Hy ® Hx is 
defined by the random variable ((Y,X),X) taking values on (y x X) x X. 

In many applications of Baycsian inference, the probability conditioned on a 
particular value should be computed. By plugging the point measure at x into 
n in Eq. (8), we have a population expression 

E[ky(;Y)\X = x]=C YX C XX - 1 k X {;x), (10) 

which has been considered in [31, 32] as the kernel mean of the conditional 
probability. It must be noted that for this case the assumption run = k(-,x) € 
H(Cxx) m Theorem 3.2 may not hold in general 2 . We will show in Theorem 6.1, 
however, that under some conditions a regularized empirical estimator based on 
Eq. (10) is a consistent estimator of E[ky(-,Y)\X = x\. 
If we replace P by Q and x by y in Eq. (10), we obtain 

m Qxly =E[k x (;Z)\W = y)=CzwC^ w ky(;y). (11) 

This is exactly the kernel mean expression of the posterior, and the next step is 
to provide a way of deriving the covariance operators Czw and Cww- Recall 
that the kernel mean vtlq = raizw) G Hx 65 Hy can be identified with the 
covariance operator Czw ■ Hy — > Hx, and Tri( W w), which is the kernel mean 
on the product space Hy <S> Hy, with Cww- Then from Eq. (9) and the similar 
expression m^ww) = C^Y'Y)xC xx mYi, we are able to obtain the operators in 
Eq. (11), and thus the kernel mean of the posterior. 

The above argument can be rigorously implemented, if empirical estimators 
are considered. Let (Xi,Y\), . . . , (X n , Y n ) be an i.i.d. sample with law P. Since 
the kernel method needs to express the information of variables in terms of 
Gram matrices given by data points, we assume that the prior is also expressed 
in the form of an empirical estimate, and that we have a consistent estimator 
of mn in the form 

e 

3=1 

where U\, . . . , U(. are points in X and 7^ are the weights. The data points Uj 
may or may not be a sample from the prior II, and negative values are allowed 
for 7^. Negative values are observed in successive applications of the kernel 
Bayes rule, as in the state-space example of Section 4.3. Based on Theorem 3.2, 
the empirical estimators for m^zw) an d m(ww) are defined respectively by 

m (ZW) — L, (YX)X\ U XX ) m n > m {WW) — i ^(YY)X\ L/ XX +£ n 1 ) m U ' 

2 Suppose Cxxhx = kx(-,x) were to hold for some h x € Hx- Taking the inner product 
with kx(',S>) would then imply kx(%,%) = / h x {x')kx{x,x')dPx(sc'), which is not possible 
for many popular kernels, including the Gaussian kernel. 
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where e n is the coefficient of the Tikhonov-type regularization for operator in- 
version, and / is the identity operator. The empirical estimators Czw and Cww 
for Czw and Cww are identified with fhtzw) and fh(ww) > respectively. In the 
following, Gx and Gy denote the Gram matrices (k x (Xi, Xj)) and (ky(Yi, Yj)), 
respectively, and I n is the identity matrix of size n. 

Proposition 3.3. The Gram matrix expressions of Czw end Cww are given 
by 

n n 

Czw = ^2'fiik x {-,X i )®ky(-,Y i ) and C W w = y^ J V>iky(-,Y i ) ® ky(-,Y), 

i=l i=l 
respectively, where the common coefficient fx £ R n is 

1 \-i 1 

= (-G x + £ n In) m n , m n ,i = m n (Xi) = ^T'Yjk x (X i ,Uj). (12) 

i=i 

The proof is similar to that of Proposition 3.4 below, and is omitted. The ex- 
pressions in Proposition 3.3 imply that the probabilities Q and Qy are estimated 
by the weighted samples {((Xj, Yf), ]3j)}" =1 and {{Yi,j2i)}™ =1 , respectively, with 
common weights. Since the weight fii may be negative, in applying Eq. (11) 
the operator inversion in the form {Cww + S n I) may be impossible or unsta- 
ble. We thus use another type of Tikhonov regularization, thus obtaining the 
estimator 

™Q x \v '■= C Z w(C ww + 5„I) C W wky{-,y). (13) 

Proposition 3.4. For any y £ y, the Gram matrix expression of rfiQ x \y * s 
given by 

fh Qxly = k x R xlY k Y (y), R X{Y := AG Y ((AG Y ) 2 + S^^A, (14) 

where A = diag(/I) is a diagonal matrix with elements fii in Eq. (12), kx = 
{k x {-,X 1 ),...,k x {-,X n )) T G H X n , andky = (ky{-, Yj), ...,ky{; Y n )) T G 
Hy n . 

Proof. Let h = {C ww +8 n I)~ 1 Cwwky(-, y), and decompose it as h = Yl7=i a i^y{'i Yf)+ 
h± = a T k Y + h±, where h± is orthogonal to Span{fc^(-, F)}" =1 . Expansion of 
(C ww +5 n I)h = C W wky{-,y) gives k^(AG Y ) 2 a + 5 n k^a + 6 n hj_ = k Y Ak Y {y)- 
Taking the inner product with ky(-,Yj), we have 

{(G Y A) 2 + 5 n I n )G Y a = G Y Ak Y (y). 

The coefficient p in rriQ x \y = Czwh — Pi^x( m , Xf) is given by p = AG Y a, 

and thus 

p = A((G Y A) 2 + «5 n / n ) _1 G F Aky(y) = AG Y ((AG Y ) 2 + S^y 1 Ak Y (y). 

□ 



Input: (i) {(Xi,Yi)}f =1 : sample to express P. (ii) {(Uj, 7j)}^ = i: weighted sam- 
ple to express the kernel mean of the prior fhu- (hi) £ n , 5 n '- regularization 
constants. 

Computation: 

1. Compute Gram matrices Gx = (kx(Xi, Xj)), Gy = (ky(Yi,Yj)), 
and a vector m n = {J2j=i 7jkx(Xi, J7j))"=i- 

2. Compute p = n(Gx + ne n I n )~ 1 m.ii. 

3. Compute R x \y = AGy((AGV) 2 + <S n /„) _1 A, where A = diag(ju). 

Output: n x n matrix Rx\y ■ 

Given conditioning value y, the kernel mean of the posterior q(x\y) is esti- 
mated by the weighted sample {(Xi, pi)}f_ 1 with weight p = Rx\Y^-Y(y), 
where \c Y (y) = (ky(Yi,y))? =1 . 



Figure 1: Algorithm of Kernel Bayes' Rule 

We call Eqs.(13) and (14) the kernel Bayes' rule (KBR). The required com- 
putations are summarized in Figure 1. The KBR uses a weighted sample to 
represent the posterior; it is similar in this respect to sampling methods such 
as importance sampling and sequential Monte Carlo ([7]). The KBR method, 
however, does not generate samples of the posterior, but updates the weights of 
a sample by matrix computation. We will give some experimental comparisons 
between KBR and sampling methods in Section 5.1. 

If our aim is to estimate the expectation of a function / £ Hx with respect 
to the posterior, the reproducing property Eq. (3) gives an estimator 

(f>™Q x \v)n x = fxRx\Yk y (y), (15) 
where f x = (f{X{), f(X n )) T e R n . 

3.2 Consistency of the KBR estimator 

We now demonstrate the consistency of the KBR estimator in Eq. (15). For the 
theoretical analysis, it is assumed that the distributions have density functions 
for simplicity. In the following two theorems, we show only the best rates that 
can be derived under the assumptions, and defer more detailed discussions and 
proofs to Section 6. We assume here that the sample size £ = £ n for the prior 
goes to infinity as the sample size n for the likelihood goes to infinity, and that 
rrijj™ is n"-consistent in RKHS norm. 

Theorem 3.5. Let f be a function in Hx, (Z,W) be a random variable on 
X x y such that the distribution is Q with p.d.f. p(y\x)ir(x), and m^"' be cm 
estimator of mn such that ||m n ? "' ) — mn||-H^ = O p (n~ a ) as n oo for some 
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< a < 1/2. Assume that ir/px G TZ(C XX ), where p x is the p.d.f. of Px , and 
E[f(Z)\W = •] G 7l(Cyyw)- For the regularization constants e n = n~i a and 
5 n = n~^ a , we have for any y G y 

fTRxw^viy) - E[f{Z)\W = y] = O p ( n -^ a ), (n oo), 

where i x Rx\Y^Y (y) * s given by Eq. (15). 

It is possible to extend the covariance operator Cww to one defined on 
L 2 {Qy) by 

C ww <t>= f ky(y,w)<t>(w)dQ y (w), (cpeL 2 {Q y )). (16) 

If we consider the convergence on average over y, we have a slightly better rate 
on the consistency of the KBR estimator in L 2 (Qy). 

Theorem 3.6. Let f be a function in Wx> (Z,W) be a random vector on 
X x y such that the distribution is Q with p.d.f. p(y\x)n(x), and m^ 1 ' be an 
estimator of mn such that \\fh^ — rnn\\H x = O p (n~ a ) as n — > oo for some 

1/2 

< a < 1/2. Assume that ir/px G T^i^xx)' where px is the p.d.f. of Px, and 
E[f(Z)\W = •] G n(C^ w ). For the regularization constants e n — n 5 Q and 
5 n = n~^ a , we have 

\\$R x \ Y *Y<yr) - E[f{Z)\W]\\ LHQy) = O p (n-4«), (n ^ oo). 

1 /2 

The condition n/px G TZ{Cxx) requires the prior to be sufficiently smooth. 
If fhjj is a direct empirical mean with an i.i.d. sample of size n from II, 
typically a = 1/2, with which the theorems imply n 4 / 27 -consistency for every y, 
and n 1 / 6 -consistency in the L 2 (Qy) sense. While these might seem to be slow 
rates, the rate of convergence can in practice be much faster than the above 
theoretical guarantees. 

4 Bayesian inference with Kernel Bayes' Rule 
4.1 Applications of Kernel Bayes' Rule 

In Bayesian inference, we are usually interested in finding a point estimate such 
as the MAP solution, the expectation of a function under the posterior, or other 
properties of the distribution. Given that KBR provides a posterior estimate in 
the form of a kernel mean (which uniquely determines the distribution when a 
characteristic kernel is used), we now describe how our kernel approach applies 
to problems in Bayesian inference. 

First, we have already seen that a consistent estimator for the expectation of 
/ G T-Lx can be defined with respect to the posterior. On the other hand, unless 
/ G %x holds, there is no theoretical guarantee that it gives a good estimate. 
In Section 5.1, we discuss some experimental results in such situations. 
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To obtain a point estimate of the posterior on x, it is proposed in [31] to 
use the preimage x = argmin^ \\kx(-,x) — \s. x Rx\Y^Y(y)\\ 2 n x , which represents 
the posterior mean most effectively by one point. We use this approach in the 
present paper when point estimates are considered. In the case of the Gaussian 
kernel exp(— \\x — y\\ 2 / (2a 2 )), the fixed point method 

(t+i) = Eti^P^M-\\X i -xW\\y(2a 2 )) 
EtiPieM-\\Xi-xW\ 2 /(2<r 2 )) ' 

where p = Rx\y^-y(v), can be used to optimize x sequentially [24]. This method 
usually converges very fast, although no theoretical guarantee exists for the con- 
vergence to the globally optimal point, as is usual in non-convex optimization. 

A notable property of KBR is that the prior and likelihood are represented in 
terms of samples. Thus, unlike many approaches to Bayesian inference, precise 
knowledge of the prior and likelihood distributions is not needed, once samples 
are obtained. The following are typical situations where the KBR approach is 
advantageous: 

• The probabilistic relation among variables is difficult to realize with a 
simple parametric model, while we can obtain samples of the variables 
easily. We will see such an example in Section 4.3. 

• The probability density function of the prior and/or likelihood is hard to 
obtain explicitly, but sampling is possible: 

— In the field of population genetics, Bayesian inference is used with 
a likelihood expressed by branching processes to model the split of 
species, for which the explicit density is hard to obtain. Approximate 
Bayesian Computation (ABC) is a popular method for approximately 
sampling from a posterior without knowing the functional form [23, 
29, 39]. 

— Another interesting application along these lines is nonparamctric 
Bayesian inference ([26] and references therein), in which the prior is 
typically given in the form of a process without a density form. In this 
case, sampling methods arc often applied ([21, 22, 41] among others). 
Alternatively, the posterior may be approximated using variational 
methods [4]. 

We will present an experimental comparison of KBR and ABC in Section 
5.2. 

• Even if explicit forms for the likelihood and prior are available, and stan- 
dard sampling methods such as MCMC or sequential MC are applicable, 
the computation of a posterior estimate given y might still be compu- 
tationally costly, making real-time applications unfeasible. Using KBR, 
however, the expectation of a function of the posterior given different y is 
obtained simply by taking the inner product as in Eq. (15), once f x Rx\Y 
has been computed. 
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4.2 Discussions concerning implementation 

When implementing KBR, a number of factors should be borne in mind to en- 
sure good performance. First, in common with many nonparametric approaches, 
KBR requires training data in the region of the new "test" points for results to 
be meaningful. In other words, if the point on which we condition appears in a 
region far from the sample used for the estimation, the posterior estimator will 
be unreliable. 

Second, in computing the posterior in KBR, Gram matrix inversion is neces- 
sary, which would cost 0(n 3 ) for sample size n if attempted directly. Substantial 
cost reductions can be achieved if the Gram matrices are approximated by low 
rank matrix approximations. A popular choice is the incomplete Cholesky de- 
composition [9], which approximates a Gram matrix in the form of IT T with 
n x r matrix r (r <C n) at cost 0(nr 2 ). Using this and the Woodbury identity, 
the KBR can be approximately computed at cost 0(nr 2 ). 

Third, kernel choice or model selection is key to the effectiveness of any kernel 
method. In the case of KBR, we have three model parameters: the kernel (or its 
parameter, e.g. the bandwidth), the regularization parameter e„, and S n . The 
strategy for parameter selection depends on how the posterior is to be used in 
the inference problem. If it is to be applied in regression, we can use standard 
cross-validation. In the filtering experiments in Section 5, we use a validation 
method where we divide the training sample in two. 

A more general model selection approach can also be formulated, by creating 
a new regression problem for the purpose. Suppose the prior II is given by the 
marginal Px of P. The posterior Qx\ v averaged with respect to Py is then equal 
to the marginal Px itself. We are thus able to compare the discrepancy of the 
empirical kernel mean of Px and the average of the estimators fnQ xly=Y - over Yi. 
This leads to a A-fold cross validation approach: for a partition of {1, . . . ,n} 
into K disjoint subsets {T a }^ =1 , let ffig^ be the kernel mean of posterior 
computed using Gram matrices on data {(Aj, Y^)}^-^, and based on the prior 
mean m x a ^ with data {Aj}^^. We can then cross validate by minimizing 

Ef=i|li^| Ej £ t ^qTL^. - ™xll«*> where ™x = yki E iG T **(•. x i)- 

4.3 Application to nonparametric state-space model 

We next describe how KBR may be used in a particular application: namely, 
inference in a general time invariant state-space model, 

T T-l 

p(X,Y) =tt(Xi) l[[p(Y t \X t ) Y[ q(X t+1 \X t ), 
t=i t=i 

where Y t is an observable variable, and A t is a hidden state variable. We begin 
with a brief review of alternative strategics for inference in state-space models 
with complex dynamics, for which linear models are not suitable. The extended 
Kalman filter (EKF) and unscented Kalman filter (UKF, [19]) are nonlinear 



12 



extensions of the standard linear Kalman filter, and are well established in this 
setting. Alternatively, nonparametric estimates of conditional density functions 
can be employed, including kernel density estimation or distribution estimates 
on a partitioning of the space [25, 40]. The latter nonparametric approaches are 
effective only for low-dimensional cases, however. Most relevant to this paper 
are [31] and [33], in which the kernel means and covariance operators are used 
to implement the nonparametric HMM. 

In this paper, we apply the KBR for inference in the nonparametric state- 
space model. We do not assume the conditional probabilities p(Y t \X t ) and 
q(X t +i\X t ) to be known explicitly, nor do we estimate them with simple para- 
metric models. Rather, we assume a sample (Xi,Yi), . . . , (Xp+i, Yr+i) is given 
for both the observable and hidden variables in the training phase. The condi- 
tional probability for observation process p{y\x) and the transition q{xt+\\xt) 
arc represented by the empirical covariance operators as computed on the train- 
ing sample, 

1 T 1 T 

Cxy = -J2 k x(-,X l )<E)ky(-,Y i ), C x+1 x =-J2 k x(;X i+ i)^k x {;Xi), 

i=l i=l 

(17) 

T T 

Cyy = ^J2ky(-,Y i )^k y (- ) Y i ), Cxx = j l Y J k x {-,X i )®k x (-,X i ). 
1=1 1=1 

While the sample is not i.i.d., we can use the empirical covariances, which 
are consistent by the mixing property of Markov models. 

Typical applications of the state-space model are filtering, prediction, and 
smoothing, which are defined by the estimation of p(x s \y\, . . . ,yt) for s = t, 
s > t, and s < t, respectively. Using the KBR, any of these can be computed. 
For simplicity we explain the filtering problem in this paper, but the remaining 
cases are similar. In filtering, given new observations y\,...,yt, we wish to 
estimate the current hidden state Xt- The sequential estimate for the kernel 
mean of p(xt\yx, ■ ■ ■ ,yt) can be derived via KBR. Suppose we already have an 
estimator of the kernel mean of p(xt\yi, ■ ■ ■ ,ijt) in the form 

T 

fh Xt \y lt ... t y t =^2af ) k x {- 1 X i ), 

i=l 

where oq' = ot\ (yi, ■ ■ ■ , yt) are the coefficients at time t. 

From p(x t+ i\yi,...,y t ) = J p(x t+ i\x t )p(x t \yi, . . . , yt)dx t , Theorem 3.2 tells 
us the kernel mean of Xt+i given yi,-.-,yt is estimated by fh Xt+ i\yi,...,yt = 
Cx^xiCxx+eTl)- 1 ™^,...,^ = k^iGx+TETlT^Gxa^Kwhere'k^ = 
(k x (-,X 2 ),..., k x (-,X T+ i)). Applying Theorem 3.2 again with p(y t +i\yi, ■ ■ ■ , yt) = 
J p(yt+i\xt+i)p(xt+i\y\, ■ ■ ■ ,jjt)dxt, we have an estimate for the kernel mean of 
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the prediction p{y t +i \yi, . . . , yt), 

T 

fhyt +1 \m,...,y t = C Y x{C'xx + £Tl)~ 1 m Xt+1 \y u ,,,,y t = y^/xf +1 ky(-,Yj), 

i=l 

where the coefficients fi,( t+1 ^ = (juj i+1 -')?L 1 are given by 

= (G x +Te T I T y 1 G xx+1 (Gx +Te T lT)~ 1 G x a ( - t l (18) 

Here Gxx +1 is the "transfer" matrix defined by {Gxx +1 ) ^ — kx(Xi, Xj+i). 

From n(r t ^i \m ?7+ . i 1 — p(i/t+i|at+i)p(st+i|a/i, ■■-.&) kernel Bavcs' rule 

±<romp^ t+ i|yi,...,y t+ ij - j- p{ y t+1 \ Xt+1 ) p{xt+1 \y u ... t y t)dxt+1 > Kernel caycs ruic 

with the prior p(xt+i\yi, ■ ■ . , yt) and the likelihood p(y t+ i|x t+ i) yields 

a (t+i) = A (Hi) GF ( (A (i+i) Gy) 2 + (5r/r )- 1 A( t + 1 )ky(y t+1 ), (19) 

where A< t+1 ) = diag(/lf +1) , . . . ,/I^ +1) ). Eqs. (18) and (19) describe the update 
rule of a {t \yi,...,y t ). 

If the prior 7r(xi) is available, the posterior estimate at x\ given yi is obtained 
by the kernel Bayes' rule. If not, we may use Eq. (10) to get an initial estimate 
Cxy(Cyy +e n I)- 1 ky(-,y 1 ), yielding a«(yi) = T(G Y + TetIt)' 1 ^ (ft). 

In sequential filtering, a substantial reduction in computational cost can 
be achieved by low rank matrix approximations, as discussed above. Given 
an approximation of rank r for the Gram matrices and transfer matrix, and 
employing the Woodbury identity, the computation costs just 0(Tr 2 ) for each 
time step. 

4.4 Bayesian computation without likelihood 

We next address the setting where the likelihood is not known in analytic form, 
but sampling is possible. In this case, Approximate Bayesian Computation 
(ABC) is a popular method for Bayesian inference. The simplest form of ABC, 
which is called the rejection method, generates a sample from q(Z\W = y) as 
follows: (i) generate a sample X t from the prior II, (ii) generate a sample Y t 
from P(Y\X t ), (iii) if D(y,Y t ) < r, accept X t ; otherwise reject, (iv) go to (i). 
In step (iii), I? is a distance measure of the space X, and r is tolerance to 
acceptance. 

In the same setting as ABC, KBR gives the following sampling-based method 
for computing the kernel posterior mean: 

1. Generate a sample X±, . . . , X n from the prior II. 

2. Generate a sample Y t from P(Y\X t ) (t — 1, . . . , n). 

3. Compute Gram matrices Gx and Gy with (X±, Y±), . . . , (X n , Y n ), and 
Rx\ Y k Y {y)- 
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Alternatively, since (X t , Y t ) is an sample from Q, it is possible to use Eq. (10) for 
the kernel mean of the conditional probability q(x\y). As in [31], the estimator 
is given by 

n 

J2vjkx(;X t ), v = (G Y + Ne N I N )- 1 k y (y). 
t=i 

The distribution of a sample generated by ABC approaches to the true pos- 
terior if r goes to zero, while empirical estimates via the kernel approaches 
converge to the true posterior mean in the limit of infinite sample size. The 
efficiency of ABC, however, can be arbitrarily poor for small r, since a sample 
X t is then rarely accepted in Step (iii). 

The ABC method generates a sample, hence any statistics based on the 
posterior can be approximated. Given a posterior mean obtained by one of 
the kernel methods, however, we may only obtain expectations of functions in 
the RKHS, meaning that certain statistics (such as confidence intervals) are not 
straightforward to obtain. In Section 5.2, we present an experimental evaluation 
of the trade-off between computation time and accuracy for ABC and KBR. 



5 Numerical Examples 

5.1 Nonparametric inference of posterior 

The first numerical example is a comparison between KBR and a kernel density 
estimation (KDE) approach to obtaining conditional densities. Let (Xl, Yi), . . . , (X n , Y n ) 
be an i.i.d. sample from P on R d x M. r . With probability density functions K x (x) 
on M. d and K y (y) on K r , the conditional probability density function p(y\x) is 
estimated by 

TUK^x-X,) 

where K*Jx) = h- d K x {x/h x ) and K% Y (x) = V A ' y fo/M (hx,hy > 0). 
Given an i.i.d. sample U\,...,Ut from the prior II, the particle representation of 
the posterior can be obtained by importance weighting (IW). Using this scheme, 
the posterior q(x\y) given y S M r is represented by the weighted sample ({/,-, Q) 

with^ =Ptv\u i )/i5 =1 mu j ). 

We compare the estimates of / xq{x\y)dx obtained by KBR and KDE + IW, 
using Gaussian kernels for both the methods. Note that the function f(x) = x 
does not belong to the Gaussian kernel RKHS, and the consistency of KBR 
is not rigorously guaranteed for this function (c.f. Theorem 3.5). That said, 
Gaussian kernels are known to be able to approximate any continuous function 
on a compact subset of the Euclidean space with arbitrary accuracy [37] . With 
such kernels, we can expect the posterior mean to be approximated with high 
accuracy on any compact set, and thus on average. In our experiments, the 
dimensionality was given by r — d ranging from 2 to 64. The distribution P of 
(X, Y) was N((0, 1 l) T , V) with V = A T A + 2I d , where l d = (1, . . . , 1) T £ R d 
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KBR vs KDE+IW (E[XIY=y]) 




2 4 8 12 16 24 32 48 64 
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Figure 2: Comparison between KBR and KDE+IW. 



and each component of A was randomly generated as N(Q, 1) for each run. The 
prior II was Px = N(0, Vxx/2), where Vxx is the A-component of V. The 
sample sizes were n = £ = 200. The bandwidth parameters hx,hy in KDE 
were set hx = hy, and chosen over the set {2 * i | i = 1, . . . , 10} in two ways: 
least square cross-validation [5, 27] and the best mean performance. For the 
KBR, we chose a in e -ll*-*'H A 3 * 8 ) in two ways: the median over the pairwisc 
distances in the data [16], and the 10- fold cross-validation approach described 
in Section 4.1. Figure 2 shows the mean square errors (MSE) of the estimates 
over 1000 random points y ~ N(0,Vyy)- KBR significantly outperforms the 
KDE+IW approach. Unsurprisingly, the MSE of both methods increases with 
dimensionality. 

5.2 Bayesian computation without likelihood 

We compare ABC and the kernel methods, KBR and conditional mean, in terms 
of estimation accuracy and computational time, since they have an obvious 
tradeoff. To compute the estimation accuracy rigorously, the ground truth is 
needed: thus we use Gaussian distributions for the true prior and likelihood, 
which makes the posterior easy to compute in closed form. The samples are 
taken from the same model used in Section 5.1, and J xq{x\y)dx is evaluated at 
10 different points of y. We performed 10 random runs with different random 
generation of the true distributions. 

For ABC, we used only the rejection method; while there are more advanced 
sampling schemes [23, 29], their implementation is dependent on the problem 
being solved. Various values for the acceptance region r are used, and the ac- 
curacy and computational time are shown in Fig. 3 together with total sizes 
of the generated samples. For the kernel methods, the sample size n is varied. 
The regularization parameters are given by e n = 0.01/ro and 5 n = 2s n for KBR, 
and e n = 0.01/-y/n for the conditional kernel mean. The kernels in the kernel 
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Figure 3: Comparison of estimation accuracy and computational time with 
KBR and ABC for Baycsian computation without likelihood. The numbers at 
the marks are the sample sizes generated for computation. 



methods are Gaussian kernels for which the bandwidth parameters are chosen 
by the median of the pairwise distances on the data ([16]). The incomplete 
Cholcsky decomposition is employed for the low-rank approximation. The re- 
sults indicate that kernel methods achieve more accurate results than ABC at a 
given computational cost, and the conditional kernel mean shows better results. 



5.3 Filtering problems 

We next compare the KBR filtering method (proposed in Section 4.3) with EKF 
and UKF on synthetic data. 

KBR has the regularization parameters St,St, and kernel parameters for 
kx and ky (e.g., the bandwidth parameter for an RBF kernel). Under the 
assumption that a training sample is available, cross-validation can be performed 
on the training sample to select the parameters. By dividing the training sample 
into two, one half is used to estimate the covariance operators Eq. (17) with a 
candidate parameter set, and the other half to evaluate the estimation errors. 
To reduce the search space and attendant computational cost, we used a simpler 
procedure, setting 5t — 2et, and using the Gaussian kernel bandwidths f3ax 
and Pay , where ax and ay are the median of pairwise distances in the training 
samples ([16]). This leaves only two parameters /3 and Et to be tuned. 

We applied the KBR filtering algorithm from Section 4.3 to two synthetic 
data sets: a simple nonlinear dynamical system, in which the degree of nonlin- 
earity can be controlled, and the problem of camera orientation recovery from 
an image sequence. In the first case, the hidden state is X t = (ut,Vt) T G M 2 , 
and the dynamics are given by 

(l t+1 ) = (1 + bsm(M9 t+1 )) (l™ 9 ^ 1 ) + Ct, t+1 =9 t+V (mod 2tt), 

where r\ > is an increment of the angle and £t ~ A(0, cr 2 ^) is independent 
process noise. Note that the dynamics of (ut,vt) are nonlinear even for 6 = 0. 
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Figure 4: Comparisons with the KBR Filter and EKF. (Average MSEs and 
standard errors over 30 runs.) 




Figure 5: Example of data (b) (X t , N = 300) 

The observation Y t follows 

Y t = {u u vtf + &~JV(0,c#), 

where is independent noise. The two dynamics are defined as follows, (a) 
(rotation with noisy observation) r\ = 0.3, 6 = 0, ah = a a = 0.2. (b) (oscillatory 
rotation with noisy observation) r\ = 0.4, b = 0.4, M = 8, ah = a = 0.2. (See 
Fig.5). 

We assume the correct dynamics are known to the EKF and UKF. The 
results arc shown in Fig. 4. In all the cases, EKF and UKF show unrecognizably 
small difference. The dynamics in (a) are weakly nonlinear, and KBR has 
slightly worse MSE than EKF and UKF. For dataset (b), which has strong 
nonlinearity, KBR outperforms the nonlinear Kalman filter for T > 200. 

In our second synthetic example, we applied the KBR filter to the camera 
rotation problem used in Song et al. [31]. The angle of a camera, which is 
located at a fixed position, is a hidden variable, and movie frames recorded by 
the camera are observed. The data are generated virtually using a computer 
graphics environment. As in [31], we are given 3600 downsampled frames of 
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KBR (Gauss) KBR (Tr) 


Kalman (9 dim.) Kalman (Quat.) 


a 2 = 10- 4 
a 2 = lO" 3 


0.210 ±0.015 0.146 ±0.003 
0.222 ±0.009 0.210 ±0.008 


1.980 ± 0.083 0.557 ±0.023 
1.935 ± 0.064 0.541 ±0.022 



Table 1: Average MSE and standard errors of estimating camera angles (10 
runs). 



20 x 20 RGB pixels (Y t g [0, l] 1200 ), where the first 1800 frames are used for 
training, and the second half are used to test the filter. We make the data noisy 
by adding Gaussian noise N(0,a 2 ) to Y t . 

Our experiments cover two settings. In the first, we assume we do not know 
that the hidden state St is included in 5*0(3), but only that it is a general 
3x3 matrix. In this case, we use the Kalman filter by estimating the relations 
under a linear assumption, and the KBR filter with Gaussian kernels for St 
and X t as Euclidean vectors. In the second setting, we exploit the fact that 
St € 50(3): for the Kalman Filter, St is represented by a quanternion, which 
is a standard vector representation of rotations; for the KBR filter the kernel 
k(A,B) = Tt[AB t ] is used for S t , and S t is estimated within 50(3) . Table 1 
shows the Frobcnius norms between the estimated matrix and the true one. The 
KBR filter significantly outperforms the EKF, since KBR has the advantage in 
extracting the complex nonlinear dependence between the observation and the 
hidden state. 



6 Proofs 

The proof idea for the consistency rates of the KBR estimators is similar to 
[6, 30], in which the basic techniques are taken from the general theory of 
regularization [8]. 

The first preliminary result is a rate of convergence for the mean transition 
in Theorem 3.2. In the following lZ(C X x) means Hx- 

Theorem 6.1. Assume that ir/px € 1Z(C XX ) for some (3 > 0, where ir and px 
are the p. d. f. of II and Px , respectively. Let rh^ be an estimator of mn such 
that ||mjj — TnYi\\n x = O p (n~ a ) as n — > oo for some < a < 1/2. Then, with 
£ n = n~ max{ 5 < *>T+V} ) we have 

W&^x + enl)- 1 ^ -m Qy \\ Hy =O p (n- mi ^««>) ) (rw oo). 

Proof. Take n <S Hx such that ir/px = C xx n. Then, we have 

m n = [ k x {;x)^-px{x)dvx{x) = C&V (20) 

First we show the rate of the estimation error: 

\\CyI(Cxx +e n l)~ 1 m < £ ) - C Y x{C X x + e„/) _1 m n || w = O p (n-%; 1/2 ), 

(21) 
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as n — > oo. By using B 1 — A 1 = B 1 (A — B)A 1 for any invertible operators 
A and B, the left hand side of Eq. (21) is upper bounded by 

\\C^(C^+s n l)-\m^-mu)\\ n ^^ 

+ ll^rxC^OfJC + £ ™-0 ipxx — @xx) ipxx + £rJ) m n\\ ny - 

By the decomposition C YX = C^^W^C^ 2 with \\W$$\\ < 1 [2], we 
have \\G'yx(Cxx + e «-0 *l = Oj>( £ n )i which implies the first term is of 
O p (n~ a e„ )■ From the y^n consistency of the covariance operators and mn = 
Cy^rj, a similar argument to the first term proves that the second and third 
terms are of the order O p (n -1 / 2 ) and O p (n _1 / 2 e„ 1 ^ 2 ), respectively, which means 
Eq. (21). 

Next, we show the rate for the approximation error 
WCYxiCxx+Sniy'mn-mQyW^ = ( £ ™{(i+2/3)Ai}) („_►«,). ( 22 ) 

Let Cyx = Cyy^WyxCxx ^ e the decomposition with ||Wrx|| < 1. It follows 
from Eq. (20) and the relation 

m Qy = [ [ H^y)—^r\P( x iy) d,y x(x)diyy(y)^CYxC xx T] 
J J Pxyx) 

that the left hand side of Eq. (22) is upper bounded by 

iicy> yx ||||(^x+ £n /)- 1 4T 3) ^-« +1)/2 '7ii^- 

By the eigendecomposition C'xx = 'Yln^i4>i{4>i: ') > where are the positive 
eigenvalues and {4>i} are the corresponding unit eigenvectors, the expansion 



(2j8+l)/2s 2 



Wlr _i_ f~ n^V( 2/3+3 )/ 2 ^ ^( 2/3+1 )/ 2 ^n 2 _X^( £nX i \ /„ a 

\\{Cxx+e n l) C xx r)-L xx V\\ Hx - A . +£ J W'<« 

E A (2/3 + l)/2 A (2/3+l)/2 £ (l-2/3)/2 (20 + 1) /2 

holds. If < j8 < 1/2, we have " A ' i+£n = ^^mm (a.+L)' 1 - 2 ^/ 2 g " - 

,,01,1;, _ x (2|3+l)/2 

eJr +1)/2 . If (3 > 1/2, then ""^. +£n < \\C X x\\e n - The dominated conver- 
gence theorem shows that the the above sum converges to zero of the order 

0(e min W +l,2} ) ag ^ ^ 

From Eqs. (21) and (22), the optimal order of e n and the optimal rate of 
consistency are given as claimed. □ 

The following theorem shows the consistency rate of the estimator used in 
the conditioning step Eq. (11). 

Theorem 6.2. Let f be a function in Hx, and (Z,W) be a random variable 
taking values in X x y. Assume that E[f(Z)\W = •] G lZ(Cyyw) f or some 
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v > 0, and Cyy Z : Hx — > Hy and Cyfyy : Hy — > Hy be compact operators, 
which may not be positive definite, such that \\Cy^ z — Cwz\\ = O p (n f ) and 
W^ww ~ Cw'H'll = O p (n r ) for some 7 > 0. Then, for a positive sequence 
5„ = n ' nla ^' s ' 1, ^ 7 + sJ \ we have as n -> 00 

Proof. Let r] e Ha- such that E[f(Z)\W = ■} = C^ w fj. First we show 

ll^(Pw) 2 + &n.l) @wzf ~ Cww(C ww + S n I)~ 1 Cwzf\\ nx 

= O p (n-^" 5 / 4 ). (23) 

The left hand side of Eq. (23) is upper bounded by 
+ IK^ww - Cww)(C ww + S n iy 1 Cwzf\\ ny 

+ llCwwiiCyyly) 2 + 5 n l) ( (C^yy ) 2 — ) (Cyyyy + S n l) Cwzf\\ Hy - 

Let Cy/yy = Yl,i^i < t ) i{4>i, ') be the eigendecomposition, where {</>i} is the unit 
eigenvectors and {A^} is the corresponding eigenvalues. From |Ai/(Af + 8 n )\ = 
1/|A; + 6 n /\i\ < 1/(2^^) = 1/(2^), we have \\C^ W ((C^ W ) 2 + 

$nl) 1 || < l/(2\/3^"), and thus the first term of the above bound is of Oj,(n^ 7 (5„ 1 ^ 2 ). 
A similar argument by the eigendecomposition of Cyvw combined with the de- 
composition Cwz = C^yyUw zC zz with ||CAyz|| < 1 shows that the second 
term is of O v (n^ A7 3/4 ). From the fact ||(C^) 2 - C ww \\ < 0^(0^ - 

Cww)\\ + \\(C{y W -Cww)Cww\\ = O p (n~"<), the third term is of O p (n~~<5n 5 ^ 4 ). 
This implies Eq. (23). 

From E[f{Z)\W = •] = C wwV and C wz f = C ww E[f{Z)\W = ■} = 
Cy^yyT], the convergence rate 

\\Cww(C ww + Kiy X C W zf - E[f(Z)\W = .]\\ Hy = 0(C n{1,|} ). (24) 

can be proved by the same way as Eq. (22). 

Combination of Eqs.(23) and (24) proves the assertion. □ 

Recall that Cww is the integral operator on L 2 (Qy) defined by Eq. (16). 
The following theorem shows the consistency rate on average. Here TZ(Cyy W ) 
means L 2 (Qy). 

Theorem 6.3. Let f be a function in Hx, and (Z,W) be a random variable 
taking values in X x y with distribution Q. Assume that E[f(Z)\W = •] G 
TZ{Cyy W ) H Hy for some v > 0, and Cffi z : Hx — > Hy and Cyyyy : Hy — > Hy 
be compact operators, which may not be positive definite, such that \\Cyy Z — 
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Cwz\\ = O p (n 7 ) and \\C^ W — Cww\\ = O p (n 7 ) for some 7 > 0. Then, for 

r 1 2 1 

a positive sequence S n = n~ ^ 2^77+27^ we as n — > 00 

II^CC^)^/)" 1 ^/-^^)!^ = -]|| L2(Qy) = O p (n- min {H^>). 

Proo/. Note that for f,g€H x weh a ve(f,g) L2(Qy) =E[f{W)g(W)] = (f,C ww g) Hx . 
It follows that the left hand side of the assertion is equal to 



\cUwC$ w {(C$ w f + S^y'd^f - cU 2 w E[f(Z)\W 



I "Hi 



First, by the similar argument to the proof of Eq. (23), it is easy to show 
that the rate of the estimation error is given by 

||^ww{^ww((^wiv) 2 + ^nl) Cy/z^ ~ Cww{C ww + S r jy 1 Cwzf}\\ ny 

= O p (n^8- 1 ). 

It suffices then to prove 

\\Cww(c ww + s^jy'Cwzf - E[f(z)\w = •]|| ia(0y) - o(C in{1,f} ). 

Let £ e L 2 (Qy) such that E[f(Z)\W = ■] = C ww £. In a similar way to 
Theorem 3.1, Cww E[f (Z)\W\ = Cwzf holds, where Cwz is the extension of 
Cwz, and thus Cwzf = C'w'vv-C- The left hand side of the above equation is 
equal to 

jjCWw^CV^/ + 5 n I) C w ^ w ^ — Cww^\\l 2 (q c y)' 

By the eigendecomposition of Cww in L 2 {Qy), a similar argument to the proof 
of Eq. (24) shows the assertion. □ 

The consistency of KBR follows by combining the above theorems. 

Theorem 6.4. Let f be a function in T~ix, (Z,W) be a random variable that 
has the distribution Q with p.d.f. p{y\x)ir{x), and mj] 1 be an estimator of mn 
such that \\rh^ — mu\\u x = O p (n~ a ) (n — > ooj for some < a < 1/2. As- 
sume that n/px e K{C XX ) with (3 > 0, and E[f{Z)\W = •] € Tl{C ww ) 
for some v > 0. For the regularization constants e„ = n~~ max 't 3 a 'T+? Q l an d 
5 n = j7 _max 'i3 7 '2^+5Tj' ! where 7 = min{|a, 09+2 a }> we have for any y E y 

ilR x \ Y ^ Y {y) - E[f(Z)\W = y} = Opjn-^*^), (n 00), 
where f^i?x|yky(y) zs given by Eq. (14)- 

Proof. By applying Theorem 6.1 to Y = (Y,X) and Y = (Y,Y), we see that 
both of \\Cwz — Cwz\\ and \\Cww — Cww\\ are of O p (n -7 ). Since 

f£ifc|yky(y)-.E7[/(^|W = y] 

= (ky(;y),C W w((C Y Y) 2 + Sniy'Cwzf - E[f(Z)\W = -Vuy, 

combination of Theorems 6.1 and 6.2 proves the theorem. □ 



22 



The next theorem shows the rate on average w.r.t. Qy. The proof is similar 
to the above theorem, and omitted. 

Theorem 6.5. Let f be a function in T-Lx, (Z,W) be a random variable that 
has the distribution Q with p.d.f. p(y\x)ir(x), and mj] be an estimator o/mn 
such that — m u\\ u x = O p (n~ a ) (n — > ooj for some < a < 1/2. Assume 

that n/px £ n(C f x X ) with /3 > 0, and E[f{Z)\W = ■] £ TZ(C^ W ) n Hy 
for some v > 0. For the regularization constants e n = n~ max "t ~s a 'TTW a ' and 
5 n = n _max {2T>iT2"'} j where 7 = min{|o;, ^ a}, we /laue 

||f£ifc|yk y (W) - E[/(Z)|W]|| i2(Qy) = O p (n~™<^>^), (n H- 00). 

We also have consistency of the estimator for the kernel mean of posterior 
m Qx\ y > if we ma ke stronger assumptions. First, we formulate the expectation 
with the posterior in terms of operators. Let (Z, W) be a random variable 
with distribution Q. Assume that for any / £ Hx the conditional expectation 
E[f(Z)\W = •] is included in Hy. We then have a linear operator S defined by 

S:Hx^Hy, f H- E[f(Z)\W = •]. 

If we further assume that S is bounded, the adjoint operator S* : H.y — > %x 
satisfies 

(S*k y (;y),f) nx ={ky(; y),Sf) Hy =E[f(Z)\W = y] 

for any y £ y, and thus S*ky{-, y) is equal to the kernel mean of the conditional 
probability of Z given W = y. 

We make the following further assumptions: 
Assumption (S) 

1. The covariance operator Cww is injective. 

2. There exists v > such that for any / £ Ux there is 77/ £ Hx with 
Sf = Cyyw 7 !!' ana - the- linear map 

C ww S '■ Ux -> Hy, / i-> 17/ 

is bounded. 

Theorem 6.6. Let (Z, W) be a random variable that has the distribution Q 
with p.d.f. p(y\x)n(x), and ffiqj be an estimator of mn such that — 
mnll-H* = O p (n~ a ) (n — > 00) for some < a < 1/2. Assume (S) above, 
and tt/px £ ^-i^xx) some /3 > 0. For the regularization constants 

e n = n - m ^ { i a '^ a} and S n = n - max{ ^<^^ } , where 7 = min{|a, ff±§a}, 
we have for any y £ y 

\\k x Rx\YkY(y)-m Qxly \\ nx =O p (n-^h^-<y) : 

as)i-> 00, where rnQ x \ y is the kernel mean of the posterior given y. 
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Proof. First, in a similar manner to the proof of Eq. (23), we have 

||^((^) 2 +M~ 1 ^U^(^)-^w(C^+5 n /)- 1 C w fc i ,(-,y)||^ 

= o>-^„- 5/4 )- 

The assertion is thus obtained if 

\\C ZW (C WW + 5 n I)- 1 C ww k y (;y) - S*ky(-,y)\\ nx = 0(C n{1 ^ } ) (25) 
is proved. The left hand side of Eq. (25) is upper-bounded by 

\\Czw{C ww + S„I) 1 Cww — S*\\ \\ky{-, y)\\-Hy 

= \\Cww(C ww + S n I) 1 Cwz - S\\ \\ky(-,y)\\n y - 

It follows from Theorem 3.1 that Cwz = CwwS, and thus \\Cww(C ww + 
6 n I)~ 1 Cwz-S\\ = \\Cww(C ww +S n I)^ 1 CwwS-S\\ < 5 n \\ (C'^ vw +5 n I)~ 1 C ww \\ ||C w i ^ / S'||. 
The eigendecomposition of Cww together with the inequality £p* s < §^ ln i 1 -"/ 2 } 
(A > 0) completes the proof. □ 
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